5. Experimental Results and Analysis¶
The agent was trained for a total of 24.5 million timesteps across 30,000 episodes, requiring 14.4 hours in the ALE/Pacman-v5 environment using RAM-based observations. This section presents a detailed analysis of the agent's learning progression, qualitative evaluation of emergent behaviors, and interpretation of policy decisions linked to the underlying RAM state.
Table 2 summarizes model performance across training episodes on a Colab L4 GPU.
| Episodes | Time (min) | Cumulative Time (hr) | Cumulative Steps | Avg Reward (Last 100 eps) | Avg Reward (Unshaped) | Eval Reward (100 eps) |
|---|---|---|---|---|---|---|
| 1–5000 | 126.27 | 2.10 | 3,590,291 | 66.60 | 138.40 | 23.93 |
| 5001–10000 | 131.67 | 4.30 | 7,359,393 | 197.66 | 273.05 | 344.82 |
| 10001–15000 | 144.32 | 6.70 | 11,482,844 | 286.32 | 368.79 | 409.53 |
| 15001–20000 | 155.22 | 9.29 | 15,879,247 | 337.27 | 425.20 | 381.54 |
| 20001–25000 | 157.55 | 11.92 | 20,276,052 | 377.02 | 464.96 | 518.19 |
| 25001–30000 | 151.42 | 14.44 | 24,511,479 | 411.19 | 495.90 | 506.12 |
5.1 Training Progression and Learning Phases¶
The training process over 30,000 episodes can be divided into six phases, each characterized by distinct behavioral patterns, performance trends, and algorithmic dynamics, as shown in Figure 2.
Phase 1: Random Exploration (Episodes 1–5,000)¶
During this initial stage, the policy behaves nearly randomly, frequently resulting in collisions with ghosts and entrapment. This is reflected in low average rewards of 66.6 (shaped) and 138.4 (unshaped), with an evaluation reward of 23.93. High entropy regularization encourages broad exploration; however, no meaningful action-reward associations are yet established. This exploration phase is critical for later training, as it facilitates the discovery of actions that yield high rewards.
Phase 2: Rapid Policy Acquisition (Episodes 5,001–10,000)¶
During this phase, the agent learns fundamental strategies including ghost avoidance and pellet collection. Average rewards increase sharply to 197.7 (shaped) and 273.1 (unshaped), while evaluation rewards reach 344.8. This period is marked by strong advantage estimates driving substantial policy improvements.
Phase 3: Strategic Consolidation (Episodes 10,001–15,000)¶
Navigation becomes more structured and efficient as the agent clears board sections methodically. Average rewards increase to 286.3 (shaped) and 368.8 (unshaped), with evaluation reward stabilizing near 409.5. PPO updates are moderate, reinforcing effective sequential behaviors.
Phase 4: Advanced Adaptation (Episodes 15,001–20,000)¶
The agent adopts higher-level tactics such as strategic power pellet use. Average rewards rise to 337.3 (shaped) and 425.2 (unshaped), while evaluation reward decreases slightly to 381.5, indicating occasional failed experiments and potential overfitting. The value function emphasizes long-term gains over immediate rewards.
Phase 5: Fine-Grained Optimization (Episodes 20,001–25,000)¶
Complex strategies emerge, including ghost herding and route optimization. Improvement slows as performance peaks with average rewards of 377.0 (shaped) and 465.0 (unshaped), and evaluation reward reaching 518.2. PPO clipping constrains updates, promoting refined improvements.
Phase 6: Performance Plateau (Episodes 25,001–30,000)¶
Agent behavior stabilizes, leading to a performance plateau. Average rewards level at 411.2 (shaped) and 495.9 (unshaped), while evaluation rewards slightly decline to 506.1. This indicates convergence to a local optimum, implying further improvements may need alternative exploration strategies or algorithmic changes.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def plot_results(results):
line_color_avg = '#3B596A'
line_color_raw = '#A3D1C8'
grid_color = '#D3D3D3'
text_color = '#4A4A4A'
accent_color = '#3B596A'
rewards_list = results
moving_average = [np.mean(rewards_list[j:j+100]) for j in range(len(rewards_list) - 99)] if len(rewards_list) >= 100 else []
fig, ax = plt.subplots(figsize=(14, 8))
ax.plot(rewards_list, alpha=0.4, color=line_color_raw, label='Episode Reward')
if moving_average:
ax.plot(range(99, len(rewards_list)), moving_average, color=line_color_avg, linestyle='-', alpha=0.8, lw=2.5, label='100-Episode Moving Average')
phase_intervals = {
(1, 5000): "Random \nExploration",
(5001, 10000): "Rapid Policy\n Acquisition",
(10001, 15000): "Strategic \nConsolidation",
(15001, 20000): "Advanced \nAdaptation",
(20001, 25000): "Fine-Grained \nOptimization",
(25001, 30000): "Performance \nPlateau"
}
for (start, end), phase_name in phase_intervals.items():
midpoint = (start + end) / 2
ax.text(midpoint, 680, phase_name, color=accent_color, fontsize=14, rotation=0, fontweight='bold', ha='center')
ax.axvline(x=end, color=accent_color, linestyle='--', lw=1.5, alpha=0.8)
ax.set_title(f"Training: {'pacman_ppo_model'} | Random Seed: 42", fontsize=18, fontweight='bold', pad=15, color=text_color)
ax.set_xlabel("Episode", fontsize=16, color=text_color)
ax.set_ylabel("Reward", fontsize=16, color=text_color)
ax.set_ylim(-100, 1020)
ax.grid(True, axis='x', color=grid_color, linestyle='-', lw=1, alpha=0.6)
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
for spine in ['left', 'bottom']:
ax.spines[spine].set_color(grid_color)
ax.tick_params(axis='x', colors=text_color, labelsize=14)
ax.tick_params(axis='y', colors=text_color, labelsize=14)
ax.set_yticks(np.arange(int(-100 // 100) * 100, 1001, 100))
plt.xlabel("Episode", fontsize=18)
plt.ylabel("Reward", fontsize=18)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
legend = ax.legend(fontsize=14, loc='upper left')
legend.get_frame().set_edgecolor(grid_color)
for text in legend.get_texts():
text.set_color(text_color)
#plt.savefig(f"{'pacman_ppo_model'}_training_plot.png")
#print(f"Saved plot to {'pacman_ppo_model'}_training_plot.png")
plt.tight_layout()
plt.show()
df = pd.read_csv('pacman_ppo_model_reward_ep30000.csv', index_col=0)
plot_results(df['reward'])
5.2 Final Model Evaluation¶
The final PPO policy was evaluated over 100 episodes to assess generalization. Training rewards without shaping reached 495.9 by episode 30,000, with evaluation rewards averaging 506.12, indicating stable performance. The highest recorded evaluation score was 812 points, as shown in Figure 3. Although some low-value wafers remain, each worth only one point, the agent typically avoids collecting them. Instead, it takes risks to consume the vitamin, which is worth 100 points. This behavior suggests that appropriate reward shaping could better guide the agent to clear the maze when that is the desired objective.
Despite training rewards increasing between episodes 25,000 and 30,000, evaluation rewards slightly declined from 518.19 to 506.12, suggesting potential overfitting and limited generalization. Qualitative analysis confirms effective power pellet use, ghost avoidance, and pellet collection, though occasional failures in complex scenarios highlight areas for improvement. Overall, results demonstrate strong task proficiency and validate PPO's suitability for medium-complexity environments while underscoring challenges in sustained generalization.
from IPython.display import display, Image
display(Image("/content/Figure3.jpg", width=600, height=400))
# !pip install gymnasium[atari]
# !pip install ale-py
import os
import ale_py
import gymnasium as gym
import matplotlib.animation as animation
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from IPython.display import HTML
from torch.distributions import Categorical
class PolicyNet(nn.Module):
"""The policy network (Actor) for selecting actions."""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super(PolicyNet, self).__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, action_dim)
)
def forward(self, state):
"""Returns logits for each action."""
return self.network(state)
class EvaluationAgent:
"""Agent to load and evaluate a trained PPO model."""
def __init__(self, state_dim, action_dim, model_path, hidden_dim=256):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.actor = PolicyNet(state_dim, action_dim, hidden_dim).to(self.device)
checkpoint = torch.load(model_path, map_location=self.device)
self.actor.load_state_dict(checkpoint['actor_state_dict'])
self.actor.eval()
print(f"Model loaded successfully from {model_path}")
def select_action(self, state):
"""Selects the best action deterministically."""
with torch.no_grad():
state_tensor = torch.FloatTensor(state / 255.0).to(self.device)
logits = self.actor(state_tensor)
action = torch.argmax(logits).item()
return action
def set_seed(seed, env=None):
"""Sets random seeds for reproducibility."""
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
if env is not None:
env.reset(seed=seed)
env.action_space.seed(seed)
env.observation_space.seed(seed)
def evaluate_pacman_model(model_path, seed, num_episodes=100, hidden_dim=256):
"""
Evaluates a PPO agent on the Pac-Man environment.
"""
gym.register_envs(ale_py)
env = gym.make("ALE/Pacman-v5", obs_type='ram', render_mode='rgb_array', mode = 4)
set_seed(seed, env)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = EvaluationAgent(state_dim, action_dim, model_path, hidden_dim)
episode_rewards = []
best_reward = -float('inf')
best_frames = []
print(f"\nRunning evaluation for {num_episodes} episodes with seed {seed}...")
for episode in range(num_episodes):
state, _ = env.reset(seed=seed + episode)
done, total_reward = False, 0
frames = []
while not done:
frames.append(env.render())
action = agent.select_action(state)
state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
episode_rewards.append(total_reward)
if total_reward > best_reward:
best_reward = total_reward
best_frames = frames
if (episode + 1) % 10 == 0:
print(f"--> Episode {episode + 1}/{num_episodes} | Average reward (last 10): {np.mean(episode_rewards[-10:]):.2f}")
env.close()
avg_score = np.mean(episode_rewards)
std_dev = np.std(episode_rewards)
min_reward = np.min(episode_rewards)
print("\n" + "="*80)
print("Evaluation Finished")
print(f"Average Score: {avg_score:.2f} +/- {std_dev:.2f}")
print(f"Best Reward: {best_reward:.2f}")
print(f"Min Reward: {np.min(episode_rewards):.2f}")
print("="*80)
return {
"average_score": avg_score,
"std_dev": std_dev,
"best_reward": best_reward,
"min_reward": min_reward,
"best_frames": best_frames,
"all_rewards": episode_rewards
}
seed = 123
model_path = ['pacman_ppo_model_ep5000.pth', 'pacman_ppo_model_ep10000.pth',
'pacman_ppo_model_ep15000.pth', 'pacman_ppo_model_ep20000.pth',
'pacman_ppo_model_ep25000.pth', 'pacman_ppo_model_ep30000.pth']
for model in model_path:
results = evaluate_pacman_model(model_path=model,seed=seed,num_episodes=100,hidden_dim=256)
Model loaded successfully from pacman_ppo_model_ep5000.pth Running evaluation for 100 episodes with seed 123... --> Episode 10/100 | Average reward (last 10): 23.20 --> Episode 20/100 | Average reward (last 10): 23.00 --> Episode 30/100 | Average reward (last 10): 25.50 --> Episode 40/100 | Average reward (last 10): 22.50 --> Episode 50/100 | Average reward (last 10): 24.30 --> Episode 60/100 | Average reward (last 10): 25.20 --> Episode 70/100 | Average reward (last 10): 21.80 --> Episode 80/100 | Average reward (last 10): 24.10 --> Episode 90/100 | Average reward (last 10): 25.90 --> Episode 100/100 | Average reward (last 10): 23.80 ================================================================================ Evaluation Finished Average Score: 23.93 +/- 5.02 Best Reward: 38.00 Min Reward: 20.00 ================================================================================ Model loaded successfully from pacman_ppo_model_ep10000.pth Running evaluation for 100 episodes with seed 123... --> Episode 10/100 | Average reward (last 10): 345.80 --> Episode 20/100 | Average reward (last 10): 347.70 --> Episode 30/100 | Average reward (last 10): 343.20 --> Episode 40/100 | Average reward (last 10): 344.80 --> Episode 50/100 | Average reward (last 10): 343.70 --> Episode 60/100 | Average reward (last 10): 340.20 --> Episode 70/100 | Average reward (last 10): 343.00 --> Episode 80/100 | Average reward (last 10): 346.30 --> Episode 90/100 | Average reward (last 10): 347.10 --> Episode 100/100 | Average reward (last 10): 346.40 ================================================================================ Evaluation Finished Average Score: 344.82 +/- 8.89 Best Reward: 386.00 Min Reward: 335.00 ================================================================================ Model loaded successfully from pacman_ppo_model_ep15000.pth Running evaluation for 100 episodes with seed 123... --> Episode 10/100 | Average reward (last 10): 380.20 --> Episode 20/100 | Average reward (last 10): 418.30 --> Episode 30/100 | Average reward (last 10): 415.90 --> Episode 40/100 | Average reward (last 10): 391.70 --> Episode 50/100 | Average reward (last 10): 402.80 --> Episode 60/100 | Average reward (last 10): 414.50 --> Episode 70/100 | Average reward (last 10): 469.40 --> Episode 80/100 | Average reward (last 10): 413.80 --> Episode 90/100 | Average reward (last 10): 386.80 --> Episode 100/100 | Average reward (last 10): 401.90 ================================================================================ Evaluation Finished Average Score: 409.53 +/- 50.19 Best Reward: 639.00 Min Reward: 273.00 ================================================================================ Model loaded successfully from pacman_ppo_model_ep20000.pth Running evaluation for 100 episodes with seed 123... --> Episode 10/100 | Average reward (last 10): 400.00 --> Episode 20/100 | Average reward (last 10): 360.30 --> Episode 30/100 | Average reward (last 10): 358.60 --> Episode 40/100 | Average reward (last 10): 404.50 --> Episode 50/100 | Average reward (last 10): 361.20 --> Episode 60/100 | Average reward (last 10): 381.00 --> Episode 70/100 | Average reward (last 10): 375.90 --> Episode 80/100 | Average reward (last 10): 373.70 --> Episode 90/100 | Average reward (last 10): 432.10 --> Episode 100/100 | Average reward (last 10): 368.10 ================================================================================ Evaluation Finished Average Score: 381.54 +/- 68.28 Best Reward: 790.00 Min Reward: 345.00 ================================================================================ Model loaded successfully from pacman_ppo_model_ep25000.pth Running evaluation for 100 episodes with seed 123... --> Episode 10/100 | Average reward (last 10): 496.30 --> Episode 20/100 | Average reward (last 10): 515.70 --> Episode 30/100 | Average reward (last 10): 509.50 --> Episode 40/100 | Average reward (last 10): 552.70 --> Episode 50/100 | Average reward (last 10): 512.60 --> Episode 60/100 | Average reward (last 10): 546.80 --> Episode 70/100 | Average reward (last 10): 502.20 --> Episode 80/100 | Average reward (last 10): 480.50 --> Episode 90/100 | Average reward (last 10): 561.20 --> Episode 100/100 | Average reward (last 10): 504.40 ================================================================================ Evaluation Finished Average Score: 518.19 +/- 112.59 Best Reward: 802.00 Min Reward: 345.00 ================================================================================ Model loaded successfully from pacman_ppo_model_ep30000.pth Running evaluation for 100 episodes with seed 123... --> Episode 10/100 | Average reward (last 10): 481.00 --> Episode 20/100 | Average reward (last 10): 567.00 --> Episode 30/100 | Average reward (last 10): 509.80 --> Episode 40/100 | Average reward (last 10): 566.10 --> Episode 50/100 | Average reward (last 10): 487.10 --> Episode 60/100 | Average reward (last 10): 463.50 --> Episode 70/100 | Average reward (last 10): 448.40 --> Episode 80/100 | Average reward (last 10): 469.60 --> Episode 90/100 | Average reward (last 10): 581.50 --> Episode 100/100 | Average reward (last 10): 487.20 ================================================================================ Evaluation Finished Average Score: 506.12 +/- 132.37 Best Reward: 812.00 Min Reward: 344.00 ================================================================================
fig, ax = plt.subplots(figsize=(8, 6))
ax.axis('off')
im = ax.imshow(results['best_frames'][0])
def animate(i):
im.set_array(results['best_frames'][i])
return [im]
anim = animation.FuncAnimation(fig, animate, frames=len(results['best_frames']), interval=50)
## Save the animation as an mp4 file
#anim.save(f"pacman_evaluation_seed{seed}_{results['best_reward']}.mp4", writer='ffmpeg', fps=20)
display(HTML(anim.to_jshtml()))
plt.close(fig)